[VL][TEST]Use Velox's HashTableCache to cache the BHJ's HashTable#12163
Open
JkSelf wants to merge 2 commits into
Open
[VL][TEST]Use Velox's HashTableCache to cache the BHJ's HashTable#12163JkSelf wants to merge 2 commits into
JkSelf wants to merge 2 commits into
Conversation
|
Run Gluten Clickhouse CI on x86 |
63a6382 to
10aec74
Compare
|
Run Gluten Clickhouse CI on x86 |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR wires Velox’s HashTableCache into Gluten’s Velox broadcast hash join (BHJ) path by stabilizing the build-side hash table identifier across AQE wrappers and propagating Spark execution metadata down into the native runtime to support cache scoping/reuse.
Changes:
- Canonicalize BHJ build hash table IDs across
BroadcastQueryStageExec/ReusedExchangeExecso reuse paths share a stable cache key. - Add Spark SQL execution id propagation (Java → JNI → native) and use it in Velox task/query identifiers.
- Integrate Velox
HashTableCacheinjection/drop in native BHJ build/cleanup, and update Velox-side build relation/cache call sites accordingly.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| gluten-ut/spark35/src/test/scala/org/apache/spark/sql/execution/adaptive/velox/VeloxAdaptiveQueryExecSuite.scala | Adjusts AQE test conf to avoid an optimizer rule impacting reuse scenarios. |
| gluten-substrait/src/main/scala/org/apache/gluten/execution/JoinExecTransformer.scala | Introduces canonical build hash table ID derivation across AQE wrappers and uses it in join parameters. |
| gluten-arrow/src/main/java/org/apache/gluten/vectorized/PlanEvaluatorJniWrapper.java | Extends JNI kernel creation signature to include Spark execution id. |
| gluten-arrow/src/main/java/org/apache/gluten/vectorized/NativePlanEvaluator.java | Extracts Spark execution id from task local properties and passes it to JNI. |
| ep/build-velox/src/get-velox.sh | Changes default Velox repo/branch selection for builds. |
| cpp/velox/substrait/SubstraitToVeloxPlan.cc | Updates BHJ plan construction to align with Velox-side hash table caching behavior. |
| cpp/velox/jni/VeloxJniWrapper.cc | Injects built hash tables into Velox HashTableCache and updates clone/clear JNI APIs to use cache keys. |
| cpp/velox/compute/WholeStageResultIterator.cc | Uses Spark execution id for Velox task/query identification. |
| cpp/core/jni/JniWrapper.cc | Propagates execution id into native SparkTaskInfo. |
| cpp/core/compute/Runtime.h | Extends SparkTaskInfo with execution id and updates formatting. |
| backends-velox/src/main/scala/org/apache/spark/sql/execution/unsafe/UnsafeColumnarBuildSideRelation.scala | Updates hash table “clone” call to pass cache key. |
| backends-velox/src/main/scala/org/apache/spark/sql/execution/ColumnarBuildSideRelation.scala | Updates hash table “clone” call to pass cache key. |
| backends-velox/src/main/scala/org/apache/gluten/execution/VeloxBroadcastBuildSideCache.scala | Updates hash table clear to drop by cache key. |
| backends-velox/src/main/scala/org/apache/gluten/execution/HashJoinExecTransformer.scala | Uses canonical build hash table ID for broadcast-table resource tracking and reuse. |
| backends-velox/src/main/java/org/apache/gluten/vectorized/HashJoinBuilder.java | Updates native API signatures to include cache key for clone/clear. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
123
to
129
| std::unordered_set<velox::core::PlanNodeId> emptySet; | ||
| velox::core::PlanFragment planFragment{planNode, velox::core::ExecutionStrategy::kUngrouped, 1, emptySet}; | ||
| std::shared_ptr<velox::core::QueryCtx> queryCtx = createNewVeloxQueryCtx(); | ||
| task_ = velox::exec::Task::create( | ||
| fmt::format( | ||
| "Gluten_Stage_{}_TID_{}_VTID_{}", | ||
| std::to_string(taskInfo_.stageId), | ||
| std::to_string(taskInfo_.taskId), | ||
| std::to_string(taskInfo.vId)), | ||
| getVeloxTaskId(taskInfo_), | ||
| std::move(planFragment), | ||
| 0, |
Comment on lines
19
to
+22
| CURRENT_DIR=$(cd "$(dirname "$BASH_SOURCE")"; pwd) | ||
| VELOX_REPO=https://github.com/IBM/velox.git | ||
| VELOX_BRANCH=dft-2026_06_04 | ||
| VELOX_ENHANCED_BRANCH=ibm-2026_06_04 | ||
| VELOX_REPO=https://github.com/JkSelf/velox.git | ||
| VELOX_BRANCH=dft-2026_06_04-hashtable-cache | ||
| VELOX_ENHANCED_BRANCH=ibm-2026_06_04-hashtable-cache |
10aec74 to
581f7e5
Compare
|
Run Gluten Clickhouse CI on x86 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes are proposed in this pull request?
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?